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Abstract. Computation on compressed strings is one of the key ap- 
proaches to processing massive data sets. We consider local subsequence 
recognition problems on strings compressed by straight-line programs 
(SLP), which is closely related to Lempel-Ziv compression. For an SLP- 
compressed text of length m, and an uncompressed pattern of length n, 
Cegielski et al. gave an algorithm for local subsequence recognition run- 
ning in time 0(fhn 2 log n). We improve the running time to 0(mn 1,5 ). 
Our algorithm can also be used to compute the longest common subse- 
quence between a compressed text and an uncompressed pattern in time 
Ofmn 1 ' 5 ); the same problem with a compressed pattern is known to be 
NP-hard. 

1 Introduction 

Computation on compressed strings is one of the key approaches to processing 
massive data sets. It has long been known that certain algorithmic problems can 
be solved directly on a compressed string, without first decompressing it; see 
(411 1J for references. 

One of the most general string compression methods is compression by straight- 
line programs (SLP) [16 . In particular, SLP compression captures the well- 
known LZ and LZW algorithms [21122120] . Various pattern matching problems 
on SLP-compressed strings have been studied; see e.g. [I] for references. Cegielski 
et al. ^ considered subsequence recognition problems on SLP-compressed strings. 
For an SLP-compressed text of length m, and an uncompressed pattern of length 
n, they gave several algorithms for global and local subsequence recognition, run- 
ning in time 0(fhn 2 log n). 

In this paper, we improve on the results of [I] as follows. First, we de- 
scribe a simple folklore algorithm for global subsequence recognition on an SLP- 
compressed text, running in time 0{mn). Then, we consider the more general 
partial semi-local longest common subsequence (LCS) problem, which consists in 
computing implicitly the LCS between the compressed text and every substring 
of the uncompressed pattern. The same problem with a compressed pattern is 
known to be NP-hard. For the partial semi-local LCS problem, we propose a 
new algorithm, running in time 0(fhn 15 ). Our algorithm is based on the partial 
highest-score matrix multiplication technique presented in [18] . We then extend 



this method to the several versions of local subsequence recognition considered 
in [3], for each obtaining an algorithm running in the same asymptotic time 
0(fhn 15 ). 

This paper is a sequel to papers |17ll8j ; we recall most of their relevant 
material here for completeness. 

2 Subsequences in compressed text 

We consider strings of characters from a fixed finite alphabet, denoting string 
concatenation by juxtaposition. Given a string, we distinguish between its con- 
tiguous substrings, and not necessarily contiguous subsequences. Special cases of 
a substring are a prefix and a suffix of a string. Given a string a of length m, we 
use the take/drop notation of [15], a] k, a J k, a \ k, a [ k, to denote respectively 
its prefix of length k, suffix of length m — k, suffix of length k, and prefix of 
length m — k. For two strings a — a\a2 ■ ■ ■ a m and b — f3\fii ■ ■ ■ (3 n of lengths 
m, n respectively, the longest common subsequence (LCS) problem consists in 
computing the length of the longest string that is a subsequence of both a and 
b. We will call this length the LCS score of the strings. 

Let T be a string of length m (typically large). String T will be represented 
implicitly by a straight-line program (SLP) of length fh, which is a sequence of fh 
statements. Each statement r, 1 < r < fh, has either the form T r — a, where a 
is an alphabet character, or the form T r = T s T t , where 1 < s, t < r. We identify 
every symbol T r with the string it represents; in particular, we have T = Tm- 
Note that m > fh, and that in general the uncompressed text length m can be 
exponential in the SLP-compresscd length fh. 

Our goal is to design efficient algorithms on SLP-compressed texts. While we 
do not allow text decompression (since, in the worst case, this could be extremely 
inefficient), we will assume that standard arithmetic operations on integers up 
to m can be performed in constant time. This assumption is necessary, since the 
counting version of our problem produces a numerical output that may be as high 
as 0(m). The same assumption on the computation model is made implicitly in 

The LCS problem on uncompressed strings is a classical problem; see e.g. |7l9j 
for the background and references. Given input strings of lengths m, n, the LCS 
problem can be solved in time O^ log ^ +n ^ ) , assuming m and n are reasonably 
close |12l6j . The LCS problem on two SLP-compressed strings is considered in 
[TT] . and proven to be NP-hard. In this paper, we consider the LCS problem on 
two input strings, one of which is SLP-compressed and the other uncompressed. 
This problem can be regarded as a special case of computing the edit distance 
between a context-free language given by a grammar of size fh, and a string of 
size n. For this more general problem, Myers |13j gives an algorithm running in 
time 0(fhn 3 + fh log fh ■ n 2 ). 

From now on, we will assume that string T (the text string) of length m 
is represented by an SLP of length fh, and that string P (the pattern string) 
of length n is represented explicitly. Following |4|llj . we study the problem of 



recognising in T subsequences identical to P, which is closely related to the LCS 
problem. 

Definition 1. The (global) subsequence recognition problem consists in decid- 
ing whether string T contains string P as a subsequence. 

The subsequence recognition problem on uncompressed strings is a classical 
problem, considered e.g. in [1] as the "subsequence matching problem" . The 
subsequence recognition problem on an SLP-compressed text is considered in [3] 
as Problem 1, with an algorithm running in time 0(mn 2 logn). 

In addition to global subsequence recognition, it is useful to consider text 
subsequences locally, i.e. in substrings of T. In this context, we will call the 
substrings of T windows. We will say that string a contains string b minimally 
as a subsequence, if b is a subsequence in a, but not in any proper substring of a. 
Even with this restriction on subsequence containment, the number of substrings 
in T containing P minimally as a subsequence may be as high as O(m), so just 
listing them all may require time exponential in m. The same is true if, instead of 
minimal substrings, we consider all substrings of T of a fixed length. Therefore, 
it is sensible to define local subsequence recognition as a family of counting 
problems 

Definition 2. The minimal- window subsequence recognition problem consists 
in counting the number of windows in string T , containing string P minimally 
as a subsequence. 

Definition 3. The fixed-window subsequence recognition problem consists in 
counting the number of windows of a given length w in string T , containing 
string P as a subsequence. 

The minimal- window and fixed- window subsequence recognition problems on 
uncompressed strings are considered in [H] as "episode matching problems" (see 
also [5] and references therein). The same problems on an SLP-compressed text 
and an uncompressed pattern are considered in [3] as Problems 2, 3 (a special 
case of 2) and 4. Additionally, the same paper considers the bounded minimal- 
window subsequence recognition problem (counting the number of windows in T 
of length at most w containing P minimally as a subsequence) as Problem 5. 
For all these problems, paper jj] gives algorithms running in time 0(fhn 2 logn). 

3 Semi-local longest common subsequences 

In this section and the next, we recall the algorithmic framework developed 
in |17|18| . This framework is subsequently used to solve the compressed subse- 
quence recognition problems introduced in the previous section. 
In |17j . we introduced the following problem. 

Definition 4. The all semi-local LCS problem consists in computing the LCS 
scores on substrings of strings a and b as follows: 



• the all string-substring LCS problem: a against every substring of b; 

• the all prefix-suffix LCS problem: every prefix of a against every suffix ofb; 

• symmetrically, the all substring-string LCS problem and the all suffix-prefix 
LCS problem, defined as above but with the roles of a and b exchanged. 

It turns out that this is a very natural and useful generalisation of the LCS 
problem. 

In addition to standard integer indices . . . , —2, — 1, 0, 1,2,.. ., we use odd half- 
integer indices . . . , — §, — |, — \, A, |, |, For two numbers i, j, we write i < j 

if j — i € {0, 1}, and i < j if j — i = 1. We denote 



[i : j] ={«,» + 1, ... ,j - 1, j} (i : j) = {i + §, i + §, . . . , j - §, j - |} 



To denote infinite intervals of integers and odd half-integers, we will use — oo for 
i and +00 for j where appropriate. For both interval types [i : j] and (i : j), we 
call the difference j — i interval length. 

We will make extensive use of finite and infinite matrices, with integer ele- 
ments and integer or odd half-integer indices. A permutation matrix is a (0,1)- 
matrix containing exactly one nonzero in every row and every column. An iden- 
tity matrix is a permutation matrix /, such that = 1 if i = j, and 
= otherwise. Each of these definitions applies to both finite and in- 
finite matrices. 

From now on, instead of "index pairs corresponding to nonzeros" , we will 
write simply "nonzeros" , where this does not lead to confusion. A finite permu- 
tation matrix can be represented by its nonzeros. When we deal with an infinite 
matrix, it will typically have a finite non-trivial core, and will be trivial (e.g. 
equal to an infinite identity matrix) outside of this core. An infinite permutation 
matrix with finite non-trivial core can be represented by its core nonzeros. 

Let D be an arbitrary numerical matrix with indices ranging over (0 : n) . Its 
distribution matrix, with indices ranging over [0 : n], is defined by 



D(i,j) = d(i - |, j + |) - d(i d(i + |, j + i) + d(i + |, j - |) 



When matrix d is a distribution matrix of D, matrix D is called the density 
matrix of d. The definitions of distribution and density matrices extend natu- 
rally to infinite matrices. We will only deal with distribution matrices where all 
elements are defined and finite. 

We will use the term permutation-distribution matrix as an abbreviation of 
"distribution matrix of a permutation matrix" . 

4 Algorithmic techniques 

The rest of this paper is based on the framework for the all semi-local LCS prob- 
lem developed in |17ll8j . For completeness, we recall most background definitions 
and results from [17], omitting the proofs. 





for all %o , jo £ [0 : n] . We have 
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Fig. 1. An alignment dag and a highest-scoring path 
4.1 Dominance counting 

It is well-known that an instance of the LCS problem can be represented by a 
dag (directed acyclic graph) on an m x n grid of nodes, where character matches 
correspond to edges scoring 1, and mismatches to edges scoring 0. 

Definition 5. Let m,n £ N. An alignment dag G is a weighted dag, defined on 
the set of nodes vi t i, I G [0 : m], i € [0 : n]. The edge and path weights are called 
scores. For all I € [1 : m], i € [1 : n]: 

• horizontal edge w/^-i — > Vij and vertical edge — > Vij are both always 
present in G and have score 0; 

• diagonal edge vi-i^-i — > U;^ may or may not be present in G; if present, it 
has score 1. 

Given an instance of the all semi-local LCS problem, its corresponding alignment 
dag is anmxn alignment dag, where the diagonal edge — > u/^ is present, 
iff a t = fa . 

Figure [l] shows the alignment dag corresponding to strings a = "baabcbca", 
b= "baabcabcabaca" (an example borrowed from [2]). 

Common string-substring, suffix-prefix, prefix-suffix, and substring-string sub- 
sequences correspond, respectively, to paths of the following form in the align- 
ment dag: 

«0,i~>^W, v l>0 ~+v mii >, Uo.t fj',n> v i,o "*"* «/',n. (!) 

where 2, /' G [0 : m], i, i' 6 [0 : n]. The length of each subsequence is equal to the 
score of its corresponding path. 

The solution to the all semi-local LCS problem is equivalent to finding the 
score of a highest-scoring path of each of the four types ([l]) between every possible 
pair of endpoints. 

To describe our algorithms, we need to modify the definition of the alignment 
dag by embedding the finite grid of nodes into in an infinite grid. 



Definition 6. Given an mx n alignment dag G, its extension G + is an infinite 
weighted dag, defined on the set of nodes Vn, l,i G [—00 : +00] and containing 
G as a subgraph. For all I, i G [—00 : +00]: 

• horizontal edge w/^-i — > Vi^ and vertical edge — > are both always 
■present in G + and have score 0; 

• when I G [1 : m], i G [1 : n], diagonal edge — > is present in G + 
iff it is present in G; if present, it has score 1; 

• otherwise, diagonal edge — > Vn is always present in G + and has 
score 1. 

An infinite dag that is an extension of some (finite) alignment dag will be called 
an extended alignment dag. When dag G + is the extension of dag G, we will 
say that G is the core of G + . Relative to G + , we will call the nodes of G core 
nodes. 

By using the extended alignment dag representation, the four path types 
can be reduced to a single type, corresponding to the all string-substring (or, 
symmetrically, substring-string) LCS problem on an extended set of indices. 

Definition 7. Given an m X n alignment dag G, its extended highest-score 
matrix is an infinite matrix defined by 

A(i,j) = max score (vo t i ~^ v m,j) hj € [—00 : +00] (2) 

where the maximum is taken across all paths between the given endpoints in the 
extension G + . If i — j, we have A(i,j) — 0. By convention, if j < i, then we let 
A(i,j) = j-i< 0. 

In Figure [I] the highlighted path has score 5, and corresponds to the value 
A(4, 11) = 5, which is equal to the LCS score of string a and substring b' = 
"cabcaba" . 

In this paper, we will deal almost exclusively with extended (i.e. finitely rep- 
resented, but conceptually infinite) alignment dags and highest-score matrices. 
From now on, we omit the term "extended" for brevity, always assuming it by 
default. 

The maximum path scores for each of the four path types ([T]) can be obtained 
from the highest-score matrix Q as follows: 

max score (vqj ~-> v m j/) = A(j,f) 

max s core (t^o v m,j') = A(— i,j ) — i 

max score(voj _ n ) = A(j, m + n — i) — m + i 

max score (i^o = A{—i, m + n — i) — m — i + i 

where G [0 : m], j,f G [0 : n], and the maximum is taken across all paths 
between the given endpoints. 



baabcabcabaca 




Fig. 2. An alignment dag and the seaweeds 
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critical, if 



An odd half-integer point G 



+00 



called A- 
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Ai 



In particular, point is never A-critical for i > j. When i = j, point is 
^-critical iff A(i - §, j + |) = 0. 

Corollary 1. Let i,j G (—00 : +00). For each i (respectively, j), there exists 
exactly one j (respectively, i) such that the point is A-critical. 

Figure [2] shows the alignment dag of Figure [I] along with the critical points. 
In particular, every critical point (i,j), where i,j G (0 : n), is represented by 
a seaweedM originating between the nodes v oi _i and v oi+ i, and terminating 
between trie nodes u m j _i and u m j+ i. The remaining seaweeds, originating or 
terminating at the sides of the dag, correspond to critical points (i, j), where 
cither i G (—to : 0) or j E (n : n + to) (or both). In particular, every critical 
point (i,j), where i £ (—to : 0) (respectively, j € (n : to + n)) is represented 
by a seaweed originating between the nodes v_ i _i and v_ i+ i (respectively, 



terminating between the nodes v. 



m+ri—j- 



and v. 



m-\-n—j-\-h ,n ) ■ 



It is convenient to consider the set of A-critical points as an infinite permu- 
tation matrix. For all i,j G (—00 : +00), we define 



D A (i,j) = 



1 if (i,j) is A-critical 
otherwise 



We denote the infinite distribution matrix of Da by d^, and consider the fol- 
lowing simple geometric relation. 

Definition 9. Point (ioiJo) dominate^] point (i,j), if io < i and j < j Q . 

1 This imaginative term was suggested by Yu. V. Matiyasevich. 

2 The standard definition of dominance requires i < io instead of io < i. Our definition 
is more convenient in the context of the LCS problem. 



Informally, the dominated point is "below and to the left" of the dominat- 
ing point in the highest-score matrisj^] Clearly, for an arbitrary integer point 
(*o>io) € [—00 : +00] 2 , the value c?a(*0)Jo) is the number of (odd half-integer) 
^-critical points it dominates. 

The following theorem shows that the set of critical points defines uniquely 
a highest-score matrix, and gives a simple formula for recovering the matrix 
elements. 

Theorem 1. For all Jo, Jo G [—00 : +00], we have 

MioJo) = k -i - d A (i ,j ) 

In Figure [2] critical points dominated by point (4,11) are represented by sea- 
weeds whose both endpoints (and therefore the whole seaweed) fit between the 
two vertical lines, corresponding to index values i — 4 and j = 11. Note that 
there are exactly two such seaweeds, and that .4.(4, 11) = 11—4 — 2 = 5. 

By Theorem[T] a highest-score matrix A is represented uniquely by an infinite 
permutation matrix Da with odd half-integer row and column indices. We will 
call matrix Da the implicit representation of A. From now on, we will refer to 
the critical points of A as nonzeros (i.e., ones) in its implicit representation. 

Recall that outside the core, the structure of an alignment graph is trivial: 
all possible diagonal edges are present in the off-core subgraph. This property 
carries over to the corresponding permutation matrix. 

Definition 10. Given an infinite permutation matrix D, its core is a square 
(possibly semi-infinite) submatrix defined by the index range [io : jo] x [ix : ji], 
where jo — io = ji — i\ (as long as both these values are defined), and such that 
for all off-core elements D(i,j), we have D(i,j) = 1 iff j — i = jo — io and 
j — i = j\ — i\ (in each case, as long as the right-hand side is defined). 

Informally, the off-core part of matrix D has nonzeros on the off-core extension 
of the core's main diagonal. 

The following statements are an immediate consequence of the definitions. 

Corollary 2. A core of an infinite permutation matrix is a (possibly semi- 
infinite) permutation matrix. 

Corollary 3. Given an alignment dag A as described above, the corresponding 
permutation matrix Da has core of size m + n, defined by i G (— m, n), j G 
(0, m+ n). 

In Figure [2] the set of critical points represented by the seaweeds corresponds 
precisely to the set of all core nonzeros in Da- Note that there are m + n = 
8 + 13 = 21 seaweeds in total. 

3 Note that these concepts of "below" and "left" are relative to the highest-score 
matrix, and have no connection to the "vertical" and "horizontal" directions in the 
alignment dag. 



Since only core nonzeros need to be represented explicitly, the implicit repre- 
sentation of a highest-score matrix can be stored as a permutation of size m + n. 
From now on, we will assume this as the default representation of such matrices. 

By Theorem]!] the value A(io,jo) is determined by the number of nonzeros in 
Da dominated by (io,jo). Therefore, an individual element of A can be obtained 
explicitly by scanning the implicit representation of A in time 0(m+n), counting 
the dominated nonzeros. However, existing methods of computational geometry 
allow us to perform this dominance counting procedure much more efficiently, 
as long as preprocessing of the implicit representation is allowed. 

Theorem 2. Given the implicit representation Da of a highest-score matrix A, 
there exists a data structure which 

• has size 0((m + n) log(m + ri)); 

• can be built in time 0((m + ri) log(m + ri)); 

• allows to query an individual element of A in time O (log 2 (m + ri)) . 
4.2 Highest-score matrix multiplication 

A common pattern in many problems on strings is partitioning the alignment 
dag into alignment subdags. Without loss of generality, consider a partitioning of 
an (M + m) x n alignment dag G into an M x n alignment dag G\ and an m x n 
alignment dag G2, where M > m. The dags G\, G2 share a horizontal row of n 
nodes, which is simultaneously the bottom row of G\ and the top row of G2; the 
dags also share the corresponding n — 1 horizontal edges. We will say that dag 
G is the concatenation of dags G\ and Gi - Let A, B, C denote the highest-score 
matrices defined respectively by dags G\, G2, G. Our goal is, given matrices A, 
B, to compute matrix C efficiently. We call this procedure highest-score matrix 
multiplication. 

Definition 11. Let n G N. Let A, B, C be arbitrary numerical matrices with 
indices ranging over [0 : n]. The (min, +)-product A (3 B — C is defined by 



Lemma 1 ([17J). Let Da, Dg, Dc be permutation matrices with indices rang- 
ing over (0 : ri), and let cLa, Ab, dc be their respective distribution matrices. 
Let dA ds = dc- Given the nonzeros of Da, Db, the nonzeros of Dc can be 



Lemma 2 ([17J). Let Da, Dg, Dc be permutation matrices with indices rang- 
ing over (—00 : +00). Let Da (respectively, Db) have semi-infinite core (0 : 
+00) 2 (respectively, (—00 : ri) 2 ). Let dA, ds, dc be the respective distribution 
matrices, and assume dA ds = dc ■ We have 



C(i,k)=min(A(iJ)+B(j,k)) 



i, j, k e [0 : n] 




D A (iJ)=D c (i,j) 
D B (j,k) =D c (j,k) 



for i G ( — 00 : +00), j G (n : +00) 
for j G (—00 : 0), k G (—00 : +00) 



(3) 
(4) 



n 




Fig. 3. An illustration of Lemma [2] 



Equations (JSjl-Q cover all but n nonzeros in each of Da, Db, Dq- These 
remaining nonzeros have i S (0 : +oo), j € (0 : n) , k € (— oo : n). Given 
the n remaining nonzeros in each of Da, Db, the n remaining nonzeros in Dq 
can be computed in time 0(n 15 ) and memory 0(n). 

The above lemma is illustrated by Figure [3] Three horizontal lines represent 
respectively the index ranges of i, j, k. The nonzeros in Da (respectively, Db) 
are shown by top-to-middle (respectively, middle-to-bottom) seaweeds; thin sea- 
weeds correspond to the nonzeros covered by Q-Q, and thick seaweeds to the 
remaining nonzeros. By Lemma [2] the nonzeros in Dc covered by ([^JhQ are 
represented by thin top-to-bottom seaweeds. The remaining nonzeros in Dq are 
not represented explicitly, but can be obtained from the thick top-to-middle and 
middle-to bottom seaweeds by Lemma [T| 



4.3 Partial highest-score matrix multiplication 

In certain contexts, e.g. when m 3> n, we may not be able to solve the all 
semi-local LCS problem, or even to store its implicit highest-score matrix. In 
such cases, we may wish to settle for the following asymmetric version of the 
problem. 

Definition 12. The partial semi-local LCS problem consists in computing the 
LCS scores on substrings of a and b as follows: 

• the all string-substring LCS problem: a against every substring of b; 

• the all prefix-suffix LCS problem: every prefix of a against every suffix ofb; 

• the all suffix-prefix LCS problem: every suffix of a against every prefix ofb. 

In contrast with the all semi-local LCS problem, the comparison of substrings of 
a against b is not required. 

Let A be the highest-score matrix for the all semi-local LCS problem. Given 
an implicit representation of A, the corresponding partial implicit representation 
consists of all nonzeros A(i,j), where either i E (0 : n), or j G (0 : n) (equiva- 
lently, (i, j) £ (0 : n) x (0 : +oo) U (— oo : n) x (0 : n)). All such nonzeros are core; 



their number is at least n and at most 2n (note that the size of a partial im- 
plicit representation is therefore independent of to) . The minimum (respectively, 
maximum) number of nonzeros is attained when all (respectively, none of) these 
nonzeros are contained in the submatrix defined by € (0 : n) X (0 : n). 

Theorem 3. Given the partial implicit representation of a highest-score matrix 
A, there exists a data structure which 



has size O(nlogn); 

can be built in time O(nlogn); 

allows to query an individual element of A, corresponding to an output of 
the partial semi-local LCS problem, in time 0(log 2 n). 



Proof. Similarly to the proof of Theorem [2] the structure in question is a 2D 
range tree built on the set of nonzeros in the partial implicit representation of 
A. □ 

The following lemma gives an equivalent of highest-score matrix multiplica- 
tion for partially represented matrices. 

Lemma 3. Consider the concatenation of alignment dags as described in Sub- 



section Jf.,2, with highest-score matrices A, B, C. Given the partial implicit rep- 
resentations of A, B, the partial implicit representation of C can be computed 
in time O^n 1 ' 5 ) and memory 0(n). 

Proof. Let D' A (i,j) = D A (i - M,j), D' B (j,k) = D B (j,k + m), D' c (i,k) = 
Db(i — M,k + to) for all i,j,k, and define d' A , d' B , d' c accordingly. It is easy to 
check that d' A © d' B — d' c , iff d A © ds = dc- Matrices D' A , D' B , D' c satisfy the 
conditions of Lemma [2] therefore all but n of the core nonzeros in the required 
partial implicit representation can be obtained by (|3j _ (|4]) in time and memory 
0(n), and the remaining n core nonzeros in time 0(tt} 5 ) and memory 0(n). □ 



5 The algorithms 

5.1 Global subsequence recognition and LCS 

We now return to the problem of subsequence recognition introduced in Sec- 
tion [2] A simple efficient algorithm for global subsequence recognition in an 
SLP-compressed string is not difficult to obtain, and has been known in folk- 
lorcQ For convenience, we generalise the problem's output: instead of a Boolean 
value, the algorithm will return an integer. 

Algorithm 1 (Global subsequence recognition). 

Input: string T of length m, represented by an SLP of length r?i; string P of 
length n, represented explicitly. 

4 The author is grateful to Y. Lifshits for pointing this out. 



Output: an integer k, giving the length of the longest prefix of P that is a 
subsequence of T. String T contains P as a subsequence, iff k — n. 
Description. The computation is performed recursively as follows. 

Let T = T'T" be the SLP statement defining string T. Let k' be the length 
of the longest prefix of P that is a subsequence of T' . Let k" be the length of 
the longest prefix of PJ kl that is a subsequence of T". Both kl and k" can be 
found recursively. We have k = k' + k". 

The base of the recursion is m = m = 1. In this case, the value k £ {0, 1} is 
determined by a single character comparison. 

Cost analysis. The running time of the algorithm is O(mfc). The proof is by 
induction. The running time of the recursive calls is respectively O(mfc') and 
0{fhk"). The overall running time of the algorithm is 0(mk')+0(fhk") + 0(l) = 
0(fhk). In the worst case, this is 0(fhri). □ 

We now address the more general partial semi-local LCS problem. Our ap- 



proach is based on the technique introduced in Subsection 4.3 



Algorithm 2 (Partial semi- local LCS). 

Input: string T of length m, represented by an SLP of length m; string P of 
length n, represented explicitly. 

Output: the partial implicit highest-score matrix on strings T, P 
Description. The computation is performed recursively as follows. 

Let T = T'T" be the SLP statement defining string T. Given the partial 
implicit highest-score matrices for each of T" and T" against P, the partial 
implicit highest-score matrix of T against P can be computed by Lemma [3] 

The base of the recursion is m = m = 1. In this case, the matrix coincides 
with the full implicit highest-score matrix, and can be computed by a simple 
scan of string P. 

Cost analysis. By Lemma [3] each implicit matrix multiplication runs in time 
0(n 15 ) and memory 0(n). There are fh recursive steps in total, therefore all the 
matrix multiplications combined run in time 0(mn L5 ) and memory 0{n). □ 

Note that the above algorithm, as a special case, provides an efficient solution 
for the LCS problem: the LCS score for T against P can easily be queried from 
the algorithm's output by Theorem [51 

The running time of Algorithm [2j should be contrasted with standard un- 
compressed LCS algorithms, running in time O ( log (^ t " | _ w - ) ) |12l6j . and with the 
NP-hardness of the LCS problem on two compressed strings [TT] , 



5.2 Local subsequence recognition 

We now show how the partial semi-local LCS algorithm of the previous section 
can be used to provide local subsequence recognition. 

Algorithm 3 (Minimal- window subsequence recognition). 

Input: string T of length m, represented by an SLP of length fh; string P of 

length n, represented explicitly. 



Output: the number of windows in T containing P minimally as a subsequence. 
Description. The algorithm runs in two phases. 

First phase. Using Algorithm [2j we compute the partial implicit highest-score 
matrix for every SLP symbol against P. For each of these matrices, we then 
build the data structure of Theorem [H 

Second phase. For brevity, we will call a window containing P minimally as a 
subsequence a P-episode window. The number of P-episode windows in T is 
computed recursively as follows. 

Let T = T'T" be the SLP statement defining string T. Let m', m" be the 
(uncompressed) lengths of strings T', T" . Let r' (respectively, r") be the number 
of P-episode windows in T' (respectively, T"), computed by recursion. 

We now need to consider the n — 1 possible prefix-suffix decompositions 
P = (P1 n'){P\n"), for all ri,n" > 0, such that ri + n" = n. Let I' (respec- 
tively, I") be the length of the shortest suffix of T" (respectively, prefix of T") 
containing P] n! (respectively, P \n") as a subsequence. The value of I' (respec- 
tively, I") can be found, or its non-existence established, by binary search on the 
first (respectively, second) index component of nonzeros in the partial implicit 
highest-score matrix of T' (respectively, T") against P. In every step of the bi- 
nary search, we make a suffix-prefix (respectively, prefix-suffix) LCS score query 
by Theorem [3] We call the interval \m! — V:m' + I"] a candidate window. 

It is easy to see that if a window in T is P-episode, then it is either contained 
within one of T', T" , or is a candidate window. Conversely, a candidate window 
is P-episode, unless there is a smaller candidate window ji], where either 
i = i\ < ji < j, or i < i\ < ji = j. Given the set of all candidate windows 
sorted separately by the lower endpoints and the higher cndpoints, this test can 
be performed in overall time 0{n). Let s be the resulting number of distinct 
P-episode candidate windows. The overall number of P-episode windows in T is 
equal to r' + r" + s. 

The base of the recursion is m < n. In this case, no windows of length n or 
more exist in T, so none can be P-episode. 

Cost analysis. 

First phase. As in Algorithm [2] the main data structure can be built in time 
0(fhn 15 ). The additional data structure of Theorem [2] can be built in time 
fh ■ O(nlogn) = 0(rhnlogn). 

Second phase. For each of n — 1 decompositions n' + n" — n, the binary search 
performs at most log n suffix-prefix and prefix-suffix LCS queries, each taking 
time 0(log 2 n). Therefore, each recursive step runs in time 2n-logn-0(log 2 n) = 
0(nlog 3 7i). There are fh recursive steps in total, therefore the whole recursion 
runs in time 0(rhn log 3 n). It is possible to speed up this phase by reusing data 
between different instances of binary search and LCS query; however, this is not 
necessary for the overall efficiency of the algorithm. 

The overall computation cost is dominated by the cost of building the main 
data structure in the first phase, equal to 0(fhn 15 ). □ 



Algorithm 4 (Fixed- window subsequence recognition). 

Input: string T of length to, represented by an SLP of length m; string P of 

length n, represented explicitly; window length w. 

Output: the number of windows of length w in T containing P as a subsequence. 
Description. 

First phase. As in Algorithm [3j 

Second phase. For brevity, we will call a window of length w containing P as 
a subsequence a (P,w)-episode window. The number of (P, u>)-episode windows 
in T is computed recursively as follows. 

Let T = T'T" be the SLP statement defining string T. Let ml, m" be the 
(uncompressed) lengths of strings T', T" . Let r' (respectively, r") be the number 
of (P, w)-episode windows in T' (respectively, T"), computed by recursion. 

We now need to consider the w — 1 windows that span the boundary between 
T and T", corresponding to strings (T \w')(T"\ w"), for all w',w" > 0, such 
that w' + w" = w. We call an interval [m! — w' : m' + w"] a candidate window. In 
contrast with the minimal-window problem, we can no longer afford to consider 
every candidate window individually, and will therefore need to count them in 
groups of "equivalent" windows. 

Let (i, j) (respectively, (j, k)) be a nonzero in the partial highest-score matrix 
of T' (respectively, T") against P. We will say such a nonzero is covered by a 
candidate window [m! — w' : ml + w"], if i S (— m! : —m! + u>') (respectively, 
fc G (m" + n — w : to" + n)). We will say that two candidate windows are 
equivalent, if they cover the same set of nonzeros both for T" and T" . 

Since the number of nonzeros for each of T', T" is at most n, the defined 
equivalence relation has at most 2n equivalence classes. Each equivalence class 
corresponds to a contiguous segment of values w' (and, symmetrically, w"), and 
is completely described by the two endpoints of this segment. Given the set of 
all the nonzeros, the endpoint description of all the equivalence classes can be 
computed in time 0(n). 

For each equivalence class of candidate windows, either none or all of them 
are (P, u;)-episode; in the latter case, we will call the whole equivalence class 
(P,w)-episode. We consider each equivalence class in turn, and pick from it an 
arbitrary representative candidate window [to' — w' : ml + w"]. Let I' (respec- 
tively, I") be the length of the longest prefix (respectively, suffix) of P contained 
in T" \w' (respectively, T" \ w") as a subsequence. The value of I' (respectively, 
I") can be found by binary search on the second (respectively, first) index compo- 
nent of nonzeros in the partial implicit highest-score matrix of T" (respectively, 
T") against P. In every step of the binary search, we make a suffix-prefix (re- 
spectively, prefix-suffix) LCS score query by Theorem [3] 

It is easy to see that the current equivalence class is (P, u;)-episode, iff I'+l" > 
n. Let s be the total size of (P, ui)-episode equivalence classes. The overall number 
of (P, u>)-episode windows in T is equal to r' + r" + s. 

The base of the recursion is m < w. In this case, no windows of length w or 
more exist in T, so none can be (P, w)-episode. 



Cost analysis. As in Algorithm [3j the total cost is dominated by the cost of 
the first phase, equal to 0(mn 15 ). □ 



The bounded minimal- window subsequence recognition problem can be solved 
by a simple modification of Algorithm [3j discarding all candidate windows of 
length greater than w. Furthermore, in addition to counting the windows, Algo- 
rithms [3] and [4] can both be easily modified to report all the respective windows 
at the additional cost of O(output). 

6 Conclusions 

We have considered several subsequence recognition problems for an SLP-compressed 
text against an uncompressed pattern. First, we mentioned a simple folklore al- 
gorithm for the global subsequence recognition problem, running in time 0(fhn). 
Relying on the previously developed framework of semi-local string comparison, 
we then gave an algorithm for the partial semi-local LCS problem, running in 
time 0(fhn 15 ); this includes the LCS problem as a special case. A natural ques- 
tion is whether the running time of partial semi-local LCS (or just LCS) can be 
improved to match global subsequence recognition. 

We have also given algorithms for the local subsequence recognition problem 
in its minimal-window and fixed-window versions. Both algorithms run in time 
0(mn 15 ), and can be easily modified to report all the respective windows at the 
additional cost of O(output). Again, a natural question is whether this running 
time can be further improved. 

Another classical generalisation of both the LCS problem and local subse- 
quence recognition is approximate matching (see e.g. |14jV Here, we look for sub- 
strings in the text that are close to the pattern in terms of the edit distance, with 
possibly different costs charged for insertions/deletions and substitutions. Once 
again, we can formulate it as a counting problem (the k-approximate matching 
problem): counting the number of windows in T that have edit distance at most 
k from P. This problem is considered on LZ-compressed strings (essentially, a 
special case of SLP-compression) in paper [TU] , which gives an algorithm running 
in time 0(fhnk). It would be interesting to see if this algorithm can be improved 
by using the ideas of the current paper. 
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